# MICROPROCESSOR AND COMPUTER ARCHITECTURE UNIT-5 advances in architecture

feedback/corrections: vibha@pesu.pes.edu

VIBHA MASTI

# ADVANCED ARCHITECTURE

- · M1 system on chip, ARM CRISC) architecture, levels of cache
- · cache levels exist even for soc
- · 8 cores & GPUs on chip
- · Multicore, multiprocessing, parallel computing

High Performance Computing

- Efficient algorithms on computers
  Capable of highest performance
  To solve most demanding problems
- D Higher Speed
- 2) Higher Throughput
- 3) High computational Power

Quantifying Capability to Solve Problem

# · Floating Point Operations Per Second

| Name       | Unit   | Value           |
|------------|--------|-----------------|
| kiloFLOPS  | kFLOPS | 10 <sup>3</sup> |
| megaFLOPS  | MFLOPS | 10              |
| gigaFLOPS  | GFLOPS | 10 <sup>9</sup> |
| teraFLOPS  | TFLOPS | 1012            |
| petaFLOPS  | PFLOPS | 1015            |
| exaFLOPS   | EFLOPS | 1018            |
| zettaFLOPS | ZFLOPS | 1021            |
| yottaFLOPS | YFLOPS | 1024            |

Jineline of Growth

Blue Gene



Super Scalar/Vector/Parallel

# Moore's Law

- · No. of transistors/chip doubles every 2 years
- · Cost of computer halved



Q: In 1988, the number of transistors in the Intel 386 SX microprocessor was 275,000. What were the transistors counts of the Pentium II Intel microprocessor in 1997?



P. = 275000

 $P_n = 275000 \times 2 = 6222539 = 6.2 \text{ million}$ 

Growth & Change in Trend



# Shift to Parallel Processing

# i) Memory wall challenge

- · Gap between processor and memory performances
- · Gap increasing
- · Memory latency and bandwidth insufficient; acts as bottleneck
- · Processors stall

# 2) Power Wall Challenge

- · Power delivery and dissipation
- · Difficulty in scaling performance of chips and systems
- · Faster computers get very hot

# AMDAHL'S LAW\*

speedup = 
$$\frac{1}{(1-f) + \frac{f}{s}}$$

S= speed up factor

- f = fraction of program that can be optimised
- 1-f = fraction of program that cannot be optimised

\* more on page 39

# Parallel computing

# Multiple CPUs for single program



### von Neumann Architecture

- · Sequence of instructions stored in memory
- · Executed sequentially
- Stored-program concept : sequence of read-write operations on memory
- · Data and instructions both stored in a single memory



Von Neumann Architecture

#### Harvard Architecture

- · Separate storage and buses for instructions and data (modern)
- · Can fetch and load/store at the same time (pipelining)



Harvard Architecture

Shift TO Parallel Computing

- 1. Bit-Level Parallelism
  - · 8-bit processor to add 16-bit numbers

#### 2. Instruction-Level Parallelism

Pipelining (different stages)

#### 3. Loop-Level Parallelism

- No dependency split across cores
- · Loop unrolling for dependency

# 

| load  | x (0)        |            |
|-------|--------------|------------|
| load  | y[0]         | 1000 times |
| add   | x [0], y [0] | l stall    |
| store | 2C0]         |            |

# IF TO TO TO ME TO WB

# IF TO TO LE TO ME TO WB

# \* IF +> ID +> IE +> ME +> WB

# IF ID IE ME WB

| Loop Uni   | rolling |              |
|------------|---------|--------------|
|            | 1 load  | x (o)        |
|            | load    | yloJ         |
| reschedule | add     | x [0], y[0]  |
|            | store   | 260]         |
|            | load    | <b>x(</b> 1) |
|            | load    | yCIJ         |
|            | add     | x[1], y[1]   |
|            | 🖌 store | 2613         |

### 4. Thread-Level Parallelism

· fine-grained thread & coarse-grained thread

core

- execution state of program (Reg, PC, stack pointer)
- interrupt logic
- execution units
- cache
- single threaded processor



# Multicore

- multiple cores on single die / chip
- chip Multiprocessor CCMP)
- thread level is task level parallelism
- each core independently executes a task (or thread)
- cores can share resources
- multithreading and multiprocessing supported

| CPU State             | CPU State                      |
|-----------------------|--------------------------------|
| Interrupt Logic       | Interrupt Logic                |
| Execution Cache Units | Execution Cache<br>Units Cache |

#### Hyperthreading

- simultaneous multi threading (SMT)
- thread defined by architecture state (interrupt logic, reg)
- shared execution units and cache
- Intel



- Amdahl's law for HT

#### (a) Fine-Grained Thread

- · different parts of program run parallely
- programmers explicitly specify parts of program to run parallely
- · hardware extracts parallelism and dynamically schedules
- · compiler dynamically schedules

#### (b) Coarse - Grained Threads

· OS responsible for scheduling tasks on different cores

#### 5. Task-Level Parallelism

- · Os or programmer
- · processes, tacks, jobs

# CLASSIFICATION of PARALLEL COMPUTERS -

- · Flynn's Taxonomy of Computer Architecture
- · Two independent dimensions: instructions and data



# (1) SISD : Single Instruction, Single Data

- Single instruction: only one instruction stream being accessed by CPU during single clock cycle
- Single data: only one data stream being accessed by CPU during single clock cycle
- · Deterministic
- Intel Atom Family (Silverthorne, Lincroft, Diamondville, Pineview) — rarely found
- · Older; sequential execution



# (2) SIMD: Single Instruction, Multiple Data

- Specialised problems with high degree of regularity (eg: image processing)
- · Two varieties: processor arrays & vector pipelines
- Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
   Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2

each core on diff mult from matrix



# Array Processor



#### (3) MIMD : Multiple Instruction, Multiple Data

- · Most common type of parallel computer
- Synchronous or asynchronous, deterministic or non-deterministic execution



# (4) MISD: Multiple Instruction, Single Data

- Few exist in real life Cexperimental one at (MU); practical purposes — does not exist
- · Single data stream fed into multiple processing units
- · Systolic arrays



# MODERN CLASSIFICATION

· Parallelism can be achieved in two ways

#### (1) Data Parallelism

- operating on multiple data in parallel

#### (2) Function Parallelism

- performing many functions in parallel (control parallelism, task parallelism)

### DATA PARALLELISM



# FUNCTIONAL PARALLELISM

 Functional programming: big task broken into smaller tasks that are independently executed in parallel and later combined to give the final result



# UNIX Process

- · consists of address space, large set of process state values, one thread of execution
- Task of kernel: create processes and dispatch them to different
   CPUs to maximise system utilisation

Classification of MMID Computers



# PARALLEL COMPUTER MEMORY ARCHITECTURES

- · shared memory
- · Distributed memory
- · Hybrid Distributed-shared Memory
- · Note: here memory is cache and not main memory





- · All CPUs share memory
- · Shared memory parallel computers
- · All processes can access memory as global address space
- · Processors operate independently but share memory resources
- Two classes based on access times: UMA (Uniform Memory Access) and NUMA ( Non-Uniform Memory Access)

#### (i) Uniform Memory Access CUMA)

- Today's symmetric Multiprocessor machines (SMPs)
- same time for all CPUs (processors) to access memory
- Also called CL-UMA (cache-coherent UMA)
- If one processor updates location in shared memory (cache), all processors know about update
- cache coherency accomplished at hardware level



# (ii) Non-Uniform Memory Access (NUMA)

- Often: linking 2 or more SMPs
  Memory access time across link is slow
  Cache coherency maintained (CC-NUMA)



# advantages

- · Global address space is user friendly for memory
- · Fast data sharing between tasks

# disadvantages

- Lack of scalability (more CPUs => geometrically more traffic on shared path
- Responsibility on programmer to ensure correct access to global memory
- · Expensive to produce for more no. of processors

# (b) Distributed Memory Prochitecture



- · Each CPV has its local memory but can access all
- · communication network to connect inter-processor memory
- · No global address space; no mapping of memory addresses
- · No concept of cache coherency
- · Programmer's responsibility to access data from another processor
- Synchronisation between tasks programmer's responsibility (initiate communication, handshake, free network etc.)
- Network fabric varies; can be ethernet

# advantages

- · Scalable memory with increase in no. of processors
- · Rapid access to processor's memory ; no cache coherency overhead
- · cost effective; can use off-shelf processors and networking

# disadvantages

- Programmer responsible for data communication between processors
- · NUMA times
- · could be hard to map data structures based on global memory

#### (c) Hybrid Architecture



- Shared memory component: cache-coherent SMP machine
- · All processors on given SMP access memory as global
- · Distributed component is networking of multiple SMPs
- · SMPs only know about their memory
- · Network communications must transfer data between SMPs

# PARALLEL PROGRAMMING LANGUAGES

· Shared memory APIs : Open MP --- C/C++, Fortran, Python



Distributed memory APIs: MPI (Message Passing Interface) —
 C/C++, Fortran, Java, Python, R, Ocaml etc.



Hybrid memory : combination of Open MP and MPI



# Shared Memory Programming

```
· Open MP API in C
```

```
#include <omp.h>
#include <stdio.h>
```

```
/*
To compile in MacOS: gcc -Xpreprocessor -fopenmp sharedmem.c -lomp -o <output_file>
To compile in Linux: gcc -fopenmp sharedmem.c -o <output_file>
*/
int main(int argc, char** argv) {
    int iam, np;
    #pragma omp parallel default(shared) private(iam, np)
    {
        np = omp_get_num_threads();
        iam = omp_get_thread_num();
        printf("Hello from thread %d out of %d\n", iam, np);
    }
    return 0;
}
```

## Executed in MacOS

| 11e   | 44 6   | acc V    |      |     |      |      | fanonmo   | char   | admom c    | lomp   | ~  | charadmom |  |
|-------|--------|----------|------|-----|------|------|-----------|--------|------------|--------|----|-----------|--|
|       |        |          |      |     | 2550 | NE - | -ropeninp | Silare | equilem. C | - comp | -0 | sharedmem |  |
| ⇒ Ur  | 11t 5  | ./share  | edme | em  |      |      |           |        |            |        |    |           |  |
| Helld | o from | 1 thread | d 3  | out | of   | 8    |           |        |            |        |    |           |  |
| Hello | o from | 1 thread | 0 b  | out | of   | 8    |           |        |            |        |    |           |  |
| Hello | o from | 1 thread | d 4  | out | of   | 8    |           |        |            |        |    |           |  |
| Hello | o from | 1 thread | d 2  | out | of   | 8    |           |        |            |        |    |           |  |
| Hello | o from | thread   | d 6  | out | of   | 8    |           |        |            |        |    |           |  |
| Hello | o from | thread   | d 1  | out | of   | 8    |           |        |            |        |    |           |  |
| Hello | o from | thread   | d 5  | out | of   | 8    |           |        |            |        |    |           |  |
| Hello | o from | thread   | d 7  | out | of   | 8    |           |        |            |        |    |           |  |
|       |        |          |      |     |      |      |           |        |            |        |    |           |  |

# Executed in Ubuntu 20

vibhamasti@ubuntu:~/Personal/CS 4/MPCA\$ gcc -fopenmp sharedmem.c -o sharedmem vibhamasti@ubuntu:~/Personal/CS 4/MPCA\$ ./sharedmem Hello from thread 0 out of 2 Hello from thread 1 out of 2

# Distributed Memory Programming

· MPI API in C

#include <mpi.h>
#include <stdio.h>

/\*

```
Compile on MacOS and Linux: mpicc distmem.c -o <output_file>
Execute on MacOS and Linux: mpirun ./<output_file>
*/
```

```
int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    int numprocs, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    printf("Hello from rank %d out of %d processors\n", rank, numprocs);
    MPI_Finalize();
}
```

### Executed on MacOS

| + Un: | it 5 m | picc  | d  | istme | em. | с- | -o distmem |  |
|-------|--------|-------|----|-------|-----|----|------------|--|
| + Un: | it 5 m | pirur | ١. | /di   | stm | em |            |  |
| Hello | from   | rank  | Ø  | out   | of  | 4  | processors |  |
| Hello | from   | rank  | 1  | out   | of  | 4  | processors |  |
| Hello | from   | rank  | 2  | out   | of  | 4  | processors |  |
| Hello | from   | rank  | 3  | out   | of  | 4  | processors |  |

# Executed on Ubuntu

vibhamasti@ubuntu:-/Personal/CS 4/MPCA\$ mpicc distmem.c -o distmem vibhamasti@ubuntu:-/Personal/CS 4/MPCA\$ mpirun ./distmem Hello from rank 0 out of 1 processors

# • Both OMP and MPI

#include <mpi.h> #include <omp.h> #include <stdio.h>

```
int main(int argc, char** argv) {
   MPI_Init(&argc, &argv);
   int numprocs, rank;
   MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   int iam, np;
   #pragma omp parallel default(shared) private(iam, np)
       np = omp_get_num_threads();
       iam = omp_get_thread_num();
       printf("Hello from thread %d out of %d and rank %d out of %d processor\n", iam, np, rank, numprocs);
```

```
MPI_Finalize();
```

# Executed in Ubuntu

vibhamasti@ubuntu:~/Personal/CS 4/MPCA\$ mpicc -fopenmp hybrid.c -o hybrid vibhamasti@ubuntu:~/Personal/CS 4/MPCA\$ mpirun ./hybrid Hello from thread 0 out of 2 and rank 0 out of 1 processor Hello from thread 1 out of 2 and rank 0 out of 1 processor

# Executed in MacOS

→ Unit 5 mpicc -Xpreprocessor -fopenmp hybrid.c -lomp -o hybrid
 → Unit 5 mpirun ./hybrid
 Hello from thread 0 out of 8 and rank 1 out of 4 processor
 Hello from thread 1 out of 8 and rank 1 out of 4 processor
 Hello from thread 2 out of 8 and rank 1 out of 4 processor
 Hello from thread 4 out of 8 and rank 1 out of 4 processor

: (truncated)

# WITH GPU

- · Cannot use only OMP & MPI
- · CUDA



# Single Program Multiple Data



Multiple Program Multiple Data





Architectural Innovations for Improved Performance

|                        | Architectural me                                                                                                                                                                                                                        | thod                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Improveme                                                | nt factor                |  |  |  |  |  |  |  |  |  |
|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------|--------------------------|--|--|--|--|--|--|--|--|--|
| Established<br>methods | 1. Pipelining (an<br>2. Cache memo<br>3. RISC and rela<br>4. Multiple instru                                                                                                                                                            | ry, 2-3 levels<br>ated ideas                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 5                                                        | 3-8<br>2-5<br>2-3<br>2-3 |  |  |  |  |  |  |  |  |  |
| Newer<br>methods       | 5. ISA extensions (e.g., for multimedia)1-36. Multithreading (super-, hyper-)2-5 ?7. Speculation and value prediction2-3 ?8. Hardware acceleration2-10 ?9. Vector and array processing2-10 ?10. Parallel/distributed computing2-1000s ? |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                          |                          |  |  |  |  |  |  |  |  |  |
| Different              | Types of Multicore                                                                                                                                                                                                                      | Architecture                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                                                          |                          |  |  |  |  |  |  |  |  |  |
|                        | CPU State<br>Interrupt Logic<br>Execution<br>Units Cache                                                                                                                                                                                | Execution Cache                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | CPU State<br>Interrupt Logic                             |                          |  |  |  |  |  |  |  |  |  |
|                        | A) Single Core                                                                                                                                                                                                                          | B) Multipro                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | cessor                                                   |                          |  |  |  |  |  |  |  |  |  |
|                        | CPU State         CPU State           Interrupt Logic         Interrupt Logic           Execution<br>Units         Cache                                                                                                                | CPU State<br>Interrupt Logic<br>Execution<br>Units Cache                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | CPU State<br>Interrupt Logic<br>Execution<br>Units Cache | single<br>chip           |  |  |  |  |  |  |  |  |  |
|                        | C) Hyper-Threading Technology                                                                                                                                                                                                           | D) Multi                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | i-core                                                   |                          |  |  |  |  |  |  |  |  |  |
|                        | CPU Sta<br>Interrupt L<br>Execution t                                                                                                                                                                                                   | Interrupt Logic                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                          |                          |  |  |  |  |  |  |  |  |  |
|                        | E) Multi-co                                                                                                                                                                                                                             | re with Shared Cache                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                                          |                          |  |  |  |  |  |  |  |  |  |
|                        | CPU State CPU State<br>Interrupt Logic Interrupt Logic CPU State                                                                                                                                                                        | and a support in the support of the | CPU State<br>Interrupt Logic<br>Cache                    |                          |  |  |  |  |  |  |  |  |  |
|                        | F) Multi-core with I                                                                                                                                                                                                                    | hyper-Threading Technolog                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | IY                                                       |                          |  |  |  |  |  |  |  |  |  |

# Thread-Level Parallelism

· SMT Dual-Core



Thread 1 Thread 3

Thread<sup>1</sup><sup>4</sup> Thread 4

# Instruction - Level Parallelism

i) Scalar

| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|----|
|   |   |   |   |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |    |
|   |   |   |   |   |   |   |   |   |    |
|   | ļ |   |   | L |   | L |   |   |    |

# (ii) Pipelining

| 1 | 2 | 3                 | 4         | 5 | 6 | 7     | 8 | 9 | 10      |
|---|---|-------------------|-----------|---|---|-------|---|---|---------|
|   |   |                   |           |   |   |       |   |   |         |
|   |   |                   |           |   |   |       |   |   |         |
|   |   |                   |           |   |   |       |   |   |         |
|   |   |                   |           |   |   |       |   |   | -       |
|   | c | s <mark>se</mark> | i         |   |   |       |   |   |         |
|   |   |                   |           |   |   |       |   |   |         |
|   |   |                   |           | - |   |       |   |   |         |
|   |   |                   |           |   |   |       |   |   | -       |
|   | - |                   | · · · · · |   |   | 8 - E |   |   |         |
|   |   |                   |           | = |   |       |   |   | i.<br>Ž |
|   |   |                   |           |   |   |       |   | ; | -       |

- (iii) Super-Scalar · multiple fetch at once ; more than 1 inst per CC · parallel execution units

| 1    | 2 | 3   | 4                           | 5        | 6  | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14   | 15 |
|------|---|-----|-----------------------------|----------|----|---|---|---|----|----|----|----|------|----|
|      |   |     | 1.<br>T                     |          |    |   |   |   |    |    |    |    | 66.ª |    |
|      |   |     |                             |          |    |   |   |   |    |    |    |    |      |    |
|      |   |     |                             |          |    |   |   |   |    |    | -  |    |      |    |
| 0    |   |     |                             |          | 7  |   | - |   |    |    |    |    |      |    |
| Thir |   |     | <b>g follov</b><br>(=6;i++) | ving loo | op |   |   |   |    |    |    |    |      |    |
|      |   | Out | =i+i;                       |          |    |   | - | - |    |    |    | -  |      |    |

# (iv) Super Pipelining

- increase depth (more pipeline etages)
  0.5 cc for one stage or 2 stages per cc, here



#### (Hardware) Dynamic Parallelism





# VLIW: Very Long Instruction Word

- Multiple independent instructions bundled together as a single long instruction
- · Done by compiler

· Different parts in parallel

# VLIW vs Superscalar

 Superscalar: each instruction fetched (multiple fetch) and executed in parallel



- VLIN: decode <u>single</u> instruction into multiple instructions and execute parallelly
- · Compiler identifies independent instr & bundles



# Drawback of VLIW

- No independent instructions found: no-ops inserted and recompile
- Improvement: EPIC- Explicit Parallel Instruction Computer Cuces speculative loading & predictions>

# EPIC

- · 64-bit microprocessor instruction set
- · 128 general & floating point unit registers
- Speculative loading: fetch instruction and execute but do not change memory content until branch decision known
- Prediction: fetch but may not execute as branch decision is known
- · Uses speculative loading, prediction and explicit parallelism

# Jeatures of EPIC

- · Group of instructions bundle
- Each bundle has stop bit : if subsequent instruction bundle depends on it
- · Dependency information determined by compiler, not hardware
- · Prefetching instructions: software prefetch instruction

 Check load instruction: checks whether a speculative load was dependent on a later store instruction and must be reloaded

- Look at ppt references

ADVANCEMENT in PARALLEL COMPUTING

Adding Two Matrices

is Uniprocessor





Uni Processor

# (ii) Multiprocessor (4)



# SPEEDUP

- Ts: best possible serial time
- · T<sub>n</sub> : time taken by parallel algorithm on n processors



# Parallel Execution of a Program

- Program can be split into two parts: parallelisable and non-parallelisable parts
- T = total time of serial execution
- · F = total time of parallelisable part when executed serially



· when executed parallely with n processors

total execution time = 
$$(T-F) + \frac{F}{D}$$

S: The total time to execute a program is set to 1. The parallelizable part of the programs consumes 60% of the execution time. What is the execution time of the program when executed on 2 processors?

$$T = | F = 0.6 N = 2$$

execution time = (1 - 0.6) + 0.6 = 0.4 + 0.3

า

Parallel computing - Processor Topology

1. Linear







3. Ring







# 5. Hypercube



- Look at slides for hypercube & mesh examples

AMDAHL'S LAW

speedup = 
$$\frac{T_s}{T_N}$$
 =  $\frac{T_c}{(T_s - F) + \frac{F}{N}}$  =  $\frac{T_s}{(1 - f)T_s + \frac{F}{N}}$   
speed up =  $\frac{1}{(1 - f) + \frac{F}{N}}$ 

if n = speedup factor s, more generally

$$speedup = \frac{1}{(1-f) + \frac{f}{s}}$$

s=speedup factor

f = fraction of program that can be optimised

1-f = fraction of program that cannot be optimised

 $\mathsf{Limit}: \mathsf{as} \quad \mathsf{s} \to \infty$ 

**Q**: What is the overall speed up if 10% of the program is made 90 times faster?



|  |  |  | <b>U</b> I | • |    |   |  |  |
|--|--|--|------------|---|----|---|--|--|
|  |  |  |            |   | 90 | 5 |  |  |
|  |  |  |            |   |    |   |  |  |

= 1.11

S: What is the overall speed up if 90% of the program is made 10 times faster?

| speedup = _       |          | 1   |        |       |     |
|-------------------|----------|-----|--------|-------|-----|
|                   | (1-f)    | • + | f<br>S |       |     |
|                   |          |     | S      |       |     |
| f = 0.9<br>s = 10 |          |     |        |       |     |
| 8 - 10            |          |     |        |       |     |
|                   |          |     |        |       |     |
| speedup           | <u> </u> |     | I      | <br>= | 100 |
|                   |          | 0.1 | + 0.9  |       | 19  |
|                   |          |     | 10     |       |     |
|                   |          |     |        |       |     |
|                   | = 5.     | 26  |        |       |     |

 $\max \text{ speedup } (f) = \frac{1}{1-f}$ 

Amdahl's Law 20 18 Parallel portion 16 50% ----- 75% 14 90% 95% 12 Speedup 10 8 6 4 2 0 2048 4096 8192 16384 32768 2 00 16 32 54 L28 256 512 L024 65536 Number of processors source : Wikipedia (i) f= 0.5 => speedupmax = 2 (ii) f=0.75 => speedupmax = 4 (iii)  $f=0.90 \Rightarrow$  Speedup max = 10 (iv)  $f=0.95 \Rightarrow$  Speedup max = 20

#### Flaws with Amdahi's Law

- · Assumes speedup independent of problem size
- · Does not account for scalability
- · Ignores communication cost

## GUSTAFSON'S LAW

- The proportion of sequential computations decreases as the problem size increases
- · Not theorem; observation
- · Assume parallel execution time fixed

speedup factor = N - (N-1)\*S

- s = serial part of code (fraction)
- Increase no. of processors as well as program size
- Q: Suppose a program has serial section of 5% and 20 processors. Find speedup according to Amdahl's and custafson's laws.

(i) Amdahl's law

$$speedup = 1 = \frac{400}{0.05 + 0.95} = 10.26$$

ಎಂ

## (ii) Guitafson's law

# speedup = 20 - (19) × 0.05

= 19.05

#### MULTICORE PROCESSORS

· Frequency limit; parallelisation required



## Limitations of single core

- · Power wall (heat)
- · Memory wall (mem access latency)
- ILP instruction level parallelism wall cdependency, instruction window size for fetch)

# Single Core CPU CHIP



- · Can be considered as one thread
- Single-threaded processor
  Register, PC, SP, interrupt logic, execution unit, cache

## multi-core CPU CHIP



- chip multiprocessing CCMPD •
- Multicore with hyperthreading
  MIMD different cores execute different threads operating on different parts of memory
- · Shared memory multiprocessor

#### Multi-core Architecture

· One thread in each core

| thread 1 | thread 2 | thread 3 | thread 4  |  |  |
|----------|----------|----------|-----------|--|--|
|          |          |          |           |  |  |
|          |          |          |           |  |  |
| с<br>0   | с<br>0   | C<br>O   | C<br>O    |  |  |
| r        | r        | r        | r         |  |  |
| е        | е        | е        | е         |  |  |
| 1        | 2        | 3        | 4         |  |  |
|          |          |          |           |  |  |
|          |          |          |           |  |  |
|          |          |          | · · · · · |  |  |
| 4        |          |          |           |  |  |

#### · Several threads in each core

| c<br>o<br>r<br>e | c<br>o<br>r<br>e | C<br>O<br>T<br>e | C<br>O<br>T<br>e |
|------------------|------------------|------------------|------------------|
| 1                | 2                | 3                | 4                |

AMDAHL'S LAW FOR MULTICORE PROCESSORS

$$speedup = \underbrace{I}_{F \times R}$$

$$\underbrace{I-F}_{Perf(R)} + \underbrace{F \times R}_{Perf(R) \times N}$$

Homogeneous & Heterogeneous Multiple Core Architecture

- · Identical processor cores: same instruction set architecture
- · Non-identical processor cores: different ISAs
- · Eg: ISAs for GPU & CPU processors



## Roles

#### • User

- use threads/processes
- spread workload
- write parallel algorithms

#### • OS

- maps threads to cores
- each core perceived as processor
- major Oses support multicore

#### · Memory

- memory contention (bandwith shared for communication and computation)
- memory refers to cache
- cache coherence protocols
- RAM on chip: M1 (Unified Memory Architecture UMA)
- False Sharing: shared cache; if multiple processors accessing same cache line/block to write, lots of unnecessary bus traffic for cache coherence ccpu cache line interference)



# Players

- tasks User .
- 20 distributed .
- Compiler Hardware .
- •





Hardware